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Abstract 

Background: Environmental shotgun sequencing (metagenomics) provides a new way to study communities in 
microbial ecology. We here use sequence data from the Global Ocean Sampling (GOS) expedition to investigate 
toxicant selection pressures revealed by the presence of detoxification genes in marine bacteria. To capture a 
broad range of potential toxicants we selected detoxification protein families representing systems protecting 
microorganisms from a variety of stressors, such as metals, organic compounds, antibiotics and oxygen radicals. 

Results: Using a bioinformatics procedure based on comparative analysis to finished bacterial genomes we found 
that the amount of detoxification genes present in marine microorganisms seems surprisingly small. The 
underrepresentation is particularly evident for toxicant transporters and proteins involved in detoxifying metals. 
Exceptions are enzymes involved in oxidative stress defense where peroxidase enzymes are more abundant in 
marine bacteria compared to bacteria in general. In contrast, catalases are almost completely absent from the 
open ocean environment, suggesting that peroxidases and peroxiredoxins constitute a core line of defense against 
reactive oxygen species (ROS) in the marine milieu. 

Conclusions: We found no indication that detoxification systems would be generally more abundant close to the 
coast compared to the open ocean. On the contrary, for several of the protein families that displayed a significant 
geographical distribution, like peroxidase, penicillin binding transpeptidase and divalent ion transport protein, the 
open ocean samples showed the highest abundance. Along the same lines, the abundance of most detoxification 
proteins did not increase with estimated pollution. The low level of detoxification systems in marine bacteria 
indicate that the majority of marine bacteria have a low capacity to adapt to increased pollution. Our study 
exemplifies the use of metagenomics data in ecotoxicology, and in particular how anthropogenic consequences on 
life in the sea can be examined. 
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Background 

Prokaryotes have evolved a rich repertoire of detoxifica- 
tion mechanisms that enable them to exploit and be 
successful in highly diverse ecological niches. By this 
repertoire bacteria impact the environment since micro- 
bial communities alter bioavailability of toxic metals, 
such as arsenic and selenium [1], and degrade a large 
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array of xenobiotic compounds [2]. In habitats under 
anthropogenic influence, where a large variety of toxic 
substances can be present, appropriate detoxification is 
believed to be instrumental for survival and growth. 
Thus, detoxification proteins are likely to play a major 
part in natural ecosystem processes, and in a broader 
perspective also contribute to human health and society. 
Despite their importance, there is a lack in our under- 
standing of the distribution and use of these detoxifica- 
tion systems in natural environments, like the ocean. 
Metagenomics, i.e. the sequencing of DNA isolated from 
environmental samples, provides a new way to study 
communities in microbial ecology, and enables studies 
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of detoxification systems in natural environments in 
an unbiased way and on a genome-wide scale. The 
gene-centered approach to ecotoxicology will become 
increasingly important since sequencing of microbial 
communities can provide relationships between de- 
toxification systems and selective pressures in specific 
environments, indicating how various selection pres- 
sures shape the gene content of microbial organisms. 

Metagenomics fosters studies of the unknown diver- 
sity, as most of the microbes in nature cannot be 
cultured in the laboratory [3]. This novel strategy has 
successfully been applied to assess a variety of biological 
questions, revealing, for example, a wide extension of 
the bacterial kinome [4], functional novelties in light- 
mediated pathways [5], the distribution of photosynthetic 
light-harvesting genes in phytoplankton communities [6], 
and shedding light on the ecological roles of proteins with 
unknown functions [7]. However, metagenomics also pro- 
vides a means to assess how particular selection pressures 
affect the gene content of a certain environment and, con- 
versely, what selection pressures that are disclosed by the 
gene content in an environmental sample. For example, 
methane consumption has been correlated to the presence 
of methane degrading enzymes, like methane mono- 
oxygenases, by metagenomics on thawing samples from the 
permafrost [8]. Also, in a highly contaminated river in 
India, large numbers of resistance genes could be found as 
a consequence of up-stream antibiotic pollution from 
factories [9]. Furthermore, in ground water highly con- 
taminated with heavy metals, nitric acid and organic 
solvents, enhanced abundance of resistance genes 
towards e.g. nitrate, cadmium and acetone has been 
reported [10], and quaternary ammonium compound 
exposure can cause enrichment of efflux pumps and 
cell envelope modification systems in microbial com- 
munities [11]. Thus, screening for genes involved in 
the handling of xenobiotics in environmental sequence 
data could provide further understanding of how 
microbes cope with high levels of toxicants, and aid 
our search for biotechnologically important detoxifica- 
tion genes. In this work, we have used sequence data 
from the Global Ocean Sampling (GOS) expedition to 
investigate toxicant selection pressures revealed by the 
presence of functionally characterized detoxification 
genes in marine environments. The GOS data still 
constitutes the largest ocean study performed over a 
geographically wide area in a consistent manner. 

Protein sequences can be grouped by functional criteria 
and there are currently around 15,000 - 17,000 classified 
protein families [12,13]. Here we systematically analyze a 
subset of those that are linked to detoxification. We thereby 
provide evidence for the extent to which toxicants in the 
marine milieu affect microorganisms and whether the 
genes present indicate differences in toxicant selection 



pressures between marine environments, like the open 
ocean compared to coastal waters. To be able to efficiently 
capture the broad range of potential toxicants, and to ad- 
dress differences between various sampling sites, we have 
selected well-characterized detoxification protein families 
representing biological systems protecting microorganisms 
from a variety of stressors, such as metals, organic com- 
pounds, antibiotics and oxygen radicals. We find that the 
amount of detoxification genes present in marine microor- 
ganisms seems to be surprisingly small. This is particularly 
evident for toxicant transporters, as well as for protein fam- 
ilies detoxifying metals. Exceptions are enzymes involved in 
the oxidative stress defense, where we found that peroxid- 
ase enzymes are more abundant than expected. In contrast, 
catalases are almost completely absent from the marine en- 
vironment, suggesting that peroxidases and peroxiredoxins 
constitute a core line of defense against reactive oxygen 
species (ROS) in this milieu. 

Results 

Selection of well-characterized bacterial systems directly 
linked to detoxification 

To enable a wide and unbiased detoxification scope in 
our metagenomics analysis we initially selected all pro- 
teins in the NCBI database related to detoxification 
mechanisms, based on their Gene Ontology (GO) classi- 
fications [14]. The term detoxification itself is not part of 
the GO terminology, and thus we used nine rather wide 
GO terms with links to detoxification, e.g. response to toxin 
(0009636), response to oxidative stress (0006979) and re- 
sponse to xenobiotic stimulus (0009410) (see Additional 
file 1: Table SI for the complete list of GO terms used). In 
addition, we initially restricted our analysis to functionally 
well-characterized detoxification systems encountered in 
the model bacterium Escherichia coli for the following rea- 
sons; i) to focus on detoxification systems in bacteria 
(which is most adequate for the GOS data), ii) to avoid any 
misclassification and/or uncertainties about the proteins' 
functional roles, and iii) to obtain a wide set of detoxifica- 
tion genes, since E. coli has a comparably large genome and 
thus harbors a rather wide array of detoxification systems. 
All obtained detoxification-linked protein sequences were 
matched against the Pfam database of profile-HMMs using 
HMMER [15], which resulted in a list of 159 Pfam protein 
profiles (Additional file 2: Table S2). The use of Pfam 
profile-HMMs representing the protein families in our 
searches of the metagenomic data allows us to find protein 
sequences from a broad range of species, due to the Hidden 
Markov Models' high sensitivity with regards to amino acid 
changes also over long evolutionary distances [16-18]. We 
finally manually removed protein families solely indirectly 
classified as linked to detoxification e.g. ribosomal proteins 
and tRNA synthetases. In this way, we reduced the list to 
31 strictly detoxification-related protein families with 
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profile-HMMs, mainly representing metal resistance, toxin 
transporters and oxidative stress protection (Table 1). 

Initial analysis of the abundance of detoxification systems 
in the Global Ocean Sampling (GOS) data 

The 31 Pfam profile-HMMs for detoxification proteins 
were compared to the 6,028,191 proteins in the GOS 
data set [19] using hmmsearch, which is part of the 
HMMER3 software [15]. The resulting number of se- 
quences found in this initial search is indicated under 
"GOS Initial" in Table 1. The most abundant (in terms 
of the number of genomic reads) of all the detoxification 
proteins were the major facilitator family of transport 
proteins (22,627 reads) and peroxiredoxins, involved in 
oxidative stress defense (14,009 reads). 

To rule out uncertainties in protein classifications, 
matches were in a second step scanned against the en- 
tire Pfam profile database and only reciprocal hits, i.e. 
sequences that best matched the same Pfam profile as 
initially found them, were further analyzed ("GOS Recip- 
rocal" in Table 1). Note that in certain cases the GOS 
initial and the GOS reciprocal values differ quite sub- 
stantially. This is particularly pronounced for the perox- 
iredoxin, thioredoxin and redoxin families where the 
reciprocal matches were below 60% of their initial num- 
bers. These larger discrepancies are almost exclusively 
due to hits to protein families belonging to the same 
Pfam clan and therefore showing fairly high sequence 
similarity, which is known to be the case for e.g. the per- 
oxiredoxin/thioredoxin proteins [20]. For the reciprocal 
matches, the major facilitator family and the ACR trans- 
porters were the most abundant protein classes, followed 
by penicillin binding transpeptidase and peroxiredoxin. 

Marine bacteria harbor a low number of detoxification 
systems 

The abundances of various detoxification systems in an 
environmental sample may tell something about the se- 
lection pressure in that particular environment. The stat- 
istical approaches proposed for analyzing these data 
usually test the null hypothesis that the abundances of 
genes are equal across samples [21-23]. However, some 
genes are frequently present as larger paralogous families 
in microbes, and thus blindly applying the reciprocal 
GOS hits indicated in Table 1 would result in a mis- 
interpretation of the selection pressure for certain 
detoxification systems in that particular environment. 
To circumvent this bias in our analyses we employed a 
procedure where we normalize for the extent that 
detoxification systems are generally found in typical bac- 
terial genomes. 

In order to enable this type of genome content 
normalization, we used the same 31 Pfam profiles and 
the reciprocal search procedure to determine the 



number of genes belonging to our detoxification set in 
835 fully sequenced and annotated bacterial genomes 
(Additional file 3). It is clear from this analysis that the 
number of genes corresponding to each detoxification 
protein family, in each bacterial genome, exhibits ra- 
ther great variation between bacterial species (Table 1; 
Additional file 4: Table S3; Additional file 5: Figure SI). 
The high sensitivity of the Pfam profiles was clearly 
apparent since we picked up homologous proteins at 
very wide evolutionary distances, e.g. the divalent ion 
tolerance protein CutAl was found in the archaeon 
Methanococcus maripaludis as well in the eubacteria 
Burkholderia mallei and Prochlorococcus marinus. The 
majority of bacterial species contain a wide spectrum 
of detoxification systems, where classes like major 
facilitator transporters (genome average 23.1 genes), 
ACR transporters (genome average 4.4 genes) and 
thioredoxin (genome average 2.7 genes) are present in 
multiple copies in almost all genomes. However, it was 
also clear that some species apparently lack even these 
common detoxification systems, e.g. the parasitic 
microbes Chlamydia trachomatis and Mycobacterium 
tuberculosis, and the coefficients of variation (CV) 
were rather large (above 100%) for many of the protein 
families. This was most apparent for genes where the 
average was below 1 and where many genomes in fact 
lack the gene completely, like copper resistance pro- 
tein CopC and tellurite resistance protein TehB which 
both exhibited CVs that were greater than 200% and 
with averages around 0.2-0.3 genes per genome. In 
fact, the majority of detoxification families had average 
and median values of gene copies below 1, revealing 
that these genes are missing in many genomes. The 
overall correlation between the average and median 
values was high (R 2 = 0.96; Additional file 6: Figure S2) and 
all analyses below have been performed using either of the 
two, giving similar results (data not shown). 

When we exclusively examined the genomes of marine 
bacteria (61 species/strains fully sequenced; Figure 1), 
we found that the most common marine bacteria in the 
GOS data (based on read recruitment to genome refer- 
ence sequences [24]), like Prochlorococcus and Synecho- 
coccus, have substantially fewer detoxification genes than 
the less common marine species. Two exceptions were 
the glutathione peroxidase and peroxiredoxin families, 
of which these marine bacteria have higher numbers 
than the average across all sequenced bacteria in our set 
(Additional file 4: Table S3). Furthermore, the stream- 
lined genome of the very common marine bacteria 
Candidatus Pelagibacter ubique [25], contained few of 
the investigated detoxification genes, as would be ex- 
pected from its small genome size. Surprisingly however, 
the number of detoxification systems in the larger ge- 
nomes of Prochlorococcus and Synechococcus were also 
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Table 1 Detoxification protein families investigated in this work, divided into functional categories 



Protein family 



Pfam name 



Number of 
reads GOS 
Initial 



Number of 
reads GOS 
Reciprocal 



GOS 

Reciprocal 



Genomes average 
(genes/genome) 



Genomes 
CV (%) 



Metal resistance 

Arsenate reductase 
Divalent ion tolerance protein 
Copper resistance protein CopC 
Multicopper oxidase 
Copper transporter 
Tellurite resistance protein 
Arsenical pump membrane protein 
Cadmium binding protein 



ArsC 

CutA1 

CopC 

Cu-oxidase_3 

CutC 

TehB 

ArsB 

YodA 



993 
511 
472 
435 
258 
2956 
177 
1 



853 
511 
426 
313 
258 
244 
19 
0 



85.9 
100 
90.3 

72 
100 

8.3 
10.7 
0 



1.1 
0.4 
0.3 
0.7 
0.2 
0.2 
0.3 
0.0 



127 
206 
154 
182 
206 
218 
492 



Transporters 

Major facilitator family 
ACR transporter 

Multi antimicrobial extrusion protein 
Efflux pump 

Small Multidrug Resistance protein 
C4-dicarboxylate transporter 



MFS_1 
ACR_tran 
MatE 
HlyD 

Multi_Drug_Res 
C4dic_mal_tran 



22627 
10591 
5164 
4048 
2194 
71 



19301 
10337 
4308 
2693 
2097 
71 



85.3 
97.6 
83.4 
66.5 
95.6 
100 



23.1 
4.4 
2.2 
6.9 
1.0 
0.3 



108 
126 
120 
121 
134 
173 



Oxidative stress 

Peroxiredoxin 
Thioredoxin 
Redoxin 
Peroxidase 

Glutathione peroxidase 
Superoxide dismutase 
Catalase 



AhpC-TSA 

Thioredoxin 

Redoxin 

peroxidase 

GSHPx 

Sod_Fe_C 

Catalase 



14009 
8221 
13178 
3067 
2380 
935 
91 



7397 
4668 
4410 
3065 
2218 



52.8 
56.8 
33.5 
99.9 
93.2 
85.1 
96.7 



3.5 
2.7 
1.7 
0.4 
0.6 
0.9 
0.8 



91 
67 
108 
154 
131 
78 
146 



Other detoxification systems 

Bacitracin resistance protein BacA 

Organic solvent tolerance protein OstA_C 

Fusaric acid resistance protein FUSC 

Di-haem cytochrome c peroxidase CCP_MauG 

Beta-lactamase Beta-lactamase 

NADH oxidase Oxidored_FMN 

Penicillin binding transpeptidase Transpeptidase 

Multiple antibiotic resistance operon repressor MarR 

MAATS-type multidrug transcriptional repressor TetR_C_2 

NADPH-dependent FMN reductase FMN_red 



2803 
1094 

67 
606 
5196 
2598 
9702 
2949 

21 
2222 



2799 
1086 
65 
582 
4985 
2349 
8379 
2078 
12 
1076 



99.9 
99.3 
97 
96 
95.9 
90.4 
86.4 
70.5 
57.1 
48.4 



0.8 
0.4 
0.8 
0.5 
2.6 
1.9 
2.1 
5.8 
0.1 
2.5 



73 
121 
205 
211 
149 
130 

72 
128 
343 
110 



Control protein families 

Elongation factor TS 
DNA polymerase A 
RNA polymerase rpoB 



EF_TS 

DNA_pol_A 
RNA_pol_Rpb2_ 



2780 
5372 
5847 



2734 
4973 
5106 



98.3 
92.6 
87.3 



0.8 
0.9 
1.0 



49 
55 
9 
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Thermotoga sp. RQ2 (126740) 

Synechococcus elongatus PCC 7942 (1140) 

Thermosipho africanus TCF52B (484019) 

Petrotoga mobilis SJ 95 (403833) 

Geobacillus kaustophilus HTA426 (235909) 

Candidatus Pelagibacter ubique HTCC1062 (335992) 

Prochlorococcus marinus str. MIT 9313 (74547) 

Prochlorococcus marinus str. MIT 9303 (59922) 

Synechococcus sp. CC9311 (64471) 

Synechococcus sp. WH 8102 (84588) 

Synechococcus sp. CC9902 (316279) 

Synechococcus sp. CC9605 (110662) 

Synechococcus sp. WH 7803 (32051) 

Prochlorococcus marinus str. NATL1A (167555) 

Prochlorococcus marinus str. AS 9601 (146891) 

Prochlorococcus marinus str. NATL2A (59920) 

Prochlorococcus marinus str. MIT 9312 (74546) 

Prochlorococcus marinus str. MIT 9515 (167542) 

Prochlorococcus marinus str. MIT 9215 (93060) 

Prochlorococcus marinus str. MIT 9301 (167546) 

Prochlorococcus marinus str. MIT 9211 (93059) 

Prochlorococcus marinus subsp. marinus str. CCMP1375 (167539) 

Microcystis aeruginosa NIES-843 (449447) 

Chlorobaculum parvum NCIB 8327 (517417) 

Magnetococcus sp. MC-1 (156889) 

Nitrosopumilus maritimus SCM1 (436308) 

Desulfurococcus kamchatkensis 1221n (490899) 

Sphingopyxis alaskensis RB2256 (317655) 

Hyphomonas neptunium ATCC 15444 (228405) 

Shewanella denitrificans OS217 (318161) 

Maricaulis maris MCS10 (394221) 

Erythrobacter litoralis HTCC2594 (314225) 

Marinobacter aquaeolei VT8 (351348) 

J annaschia sp. CCS1 (290400) 

Dinoroseobacter shibae DFL 12 (398580) 

Roseobacter denitrificans OCh 114 (375451) 

Marinomonas sp. MWYL1 (400668) 

Saccharophagus degradans 2-40 (203122) 

Pseudoalteromonas haloplanktis TAC125 (326442) 

Alcanivorax borkumensis SK2 (393595) 

Shewanella woodyi ATCC 51908 (392500) 

Shewanella loihica PV-4 (323850) 

Shewanella sp. MR^ (60480) 

Shewanella baltica OS155 (325240) 

Shewanella frigidimarina NCIMB 400 (318167) 

Nitrosococcus oceani ATCC 19707 (323261) 

Desulfovibrio vulgaris str. 'Miyazaki F' (883) 

Cyanothece sp. ATCC 51142 (43989) 

Chlorobium phaeobacteroides BS1 (331678) 

Acaryochloris marina MBIC11017 (329726) 

Prosthecochloris aestuarii DSM 271 (290512) 

Trichodesmium erythraeum IMS101 (203124) 

Gramella forsetii KT0803 (411154) 

Shewanella baltica OS223 (407976) 

Shewanella baltica OS195 (399599) 

Shewanella sp. MR-7 (60481) 

Vibrio parahaemolytjcus RIMD 2210633 (223926) 

Photobacterium profundum SS9 (298386) 

Vibrio fischeri MJ 11 (388396) 

Rhodopseudomonas palustris CGA009 (258594) 

Psychromonas ingrahamii 37 (357804) 



Figure 1 Presence of detoxification proteins in genomes of various marine bacteria. The Pfam profiles for all studied detoxification proteins 
were screened against 835 fully sequenced bacterial genomes. The data for the subset of bacterial species/strain that have been reported as 
marine are on display. The most commonly found marine bacteria, based on fragment recruitment to genome reference sequences [24], are 
indicated in blue. No gene found (grey), 1 gene per genome (black) and greater than 1 gene per genome (red). 
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very low and correspond very closely to the numbers in 

Candidatus Pelagibacter. 

RNA polymerase rpoB provides a good proxy for the 
number of genomes in the sample 

In order to set our data in relation to the number of 
genomes present we included in our analysis three 
control proteins with no clear link to detoxification but 
part of central cellular functions; RNA polymerase Rpb2 
domain 6 (corresponding to the rpoB gene), DNA 
polymerase A and translational elongation factor TS. 
The control genes were chosen to represent genes 
generally present in bacteria and mostly in one copy 
per bacterial genome (Table 1 and Additional file 4: 
Table S3). We observed that for our three control pro- 
teins, the RNA polymerase gene was most consistently 
found in one copy per genome for the fully sequenced 
bacteria and displayed the lowest variation (1.0 ± 0.1 
[±SD]). This consistency also holds if only marine ge- 
nomes, or only the most abundant bacterial species in 
the ocean, are considered (Additional file 4: Table S3). 
In addition, RNA polymerase appeared quite evenly 
distributed at all the different GOS sites and conse- 
quently exhibited a low coefficient of variation (25%). 
Hence, we have used the RNA polymerase gene to 
normalize the GOS data to the expected number of 
genomes per sample. 

Most detoxification proteins are under-represented at the 
GOS sites in relation to the expectation 

Two-dimensional hierarchical clustering was initially 
performed, after normalization to the number of ge- 
nomes (based on the RNA polymerase data), to visualize 
relations between detoxification systems as well as 
between sampling sites (Figure 2, top). Interestingly, we 
found that almost all investigated detoxification protein 
families are on average found less than once per genome 
in the GOS data (Figure 2, top), i.e. these genes are miss- 
ing in a large portion of the marine bacterial genomes. 
The exceptions were the major facilitator transporter, 
the ACR transporter, the penicillin binding transpepti- 
dase, and the peroxiredoxin families, that were mostly 
represented by one or more genes per genome. 

However, since some of these detoxification systems 
are part of commonly found paralogous families, we also 
compared our findings in the GOS data to what would 
be expected for the various detoxification systems in an 
average bacterial genome (using our data on the 835 
fully sequenced genomes). Interestingly, after this "para- 
log normalization" virtually all detoxification protein 
families were still underrepresented at the various GOS 
sites (Figure 2, bottom; Additional file 7: Figure S3). Fur- 
thermore, we found that the detoxification systems were 
distributed in two distinct groups. For example, all the 



metal resistance proteins, such as copper transporters 
and tellurite resistance proteins, belonged to a cluster 
containing markedly underrepresented detoxification 
components in marine bacteria (GOS) compared to an 
average bacterial genome. The second group included 
all investigated proteins related to the oxidative stress 
response, except for the superoxide dismutase family. 
Interestingly, we also found that in the most frequently 
encountered marine species (based on read recruitment 
[24], i.e. the genera Pelagibacter, Synechococcus, Pro- 
chlorococcus and Nitrosopumilus) there is a marked 
underrepresentation of certain detoxification systems 
(Figure 1). We conclude that in a majority of cases, 
abundance of the detoxification genes did not correlate 
with ecological success. 

Analysis of marine non-E coli detoxification systems 

To make sure we provide an exhaustive description of 
detoxification, we next extended our analysis to include 
six additional protein families, known to be involved 
in detoxification in marine bacteria [26-29], but not 
present in E. coli (Additional file 8: Table S4). Investiga- 
tion of these additional detoxification families revealed 
very similar results to the findings based on our initial 
set of 31 well-characterized E. co//-derived Pfam families: 
i) all six protein families were lowly represented in the 
GOS data (Additional file 8: Table S4), as well as in 
marine genomes (Figure 3a), ii) those that were abun- 
dant enough to generate distribution data, all showed up 
in less than one copy per genome at virtually all sites 
(Figure 3b), and iii) when compared to what would be 
expected from the sequenced bacterial genomes, these 
protein families showed the same pattern of marked 
underrepresentation (Figure 3c). We conclude that all 
data taken together strongly indicates that our results 
of general underrepresentation in marine bacteria will 
hold for most detoxification-related protein families. 

Geographical implications and anthropogenic influence 

Our initial hypothesis was that marine sampling sites 
close to the coast would be more exposed to anthropo- 
genic influence. Thus, we expected a greater repertoire 
and/or abundance of detoxification systems at these sites 
compared to sites from the open ocean. The GOS sam- 
pling sites are located in the Atlantic and the Pacific 
Oceans, and include samples taken from open ocean, 
coastal and estuarine habitats [30,31]. 

When normalized to the number of genes present in 
the average bacterial genome (Figure 2, bottom; note 
that the different sampling sites are color coded by geo- 
graphical location), sites expected to have similar envir- 
onmental conditions mostly clustered together based on 
their content of detoxification genes. For example, the 
three Sargasso Sea samples GS-OOb, c and d clustered 
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Normalized to 
detoxification genes in 
the average bacterial 
genome 




Figure 2 (See legend on next page.) 
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(See figure on previous page.) 

Figure 2 Distribution of detoxification protein families in the GOS data. Gene counts normalized to per genome-equivalents based on 
the occurrence of the RNA-polymerase gene (top), and gene counts normalized to the average detoxification gene content of 835 completely 
sequenced bacterial genomes (bottom). For site names, blue color indicates open ocean sites, red corresponds to coastal sites, and green 
represents estuaries and embayments. Protein families are color coded as follows; red - oxidative stress, blue - metal resistance, green - transporters, 
purple - other detoxification systems, grey - control proteins. 



tightly together. However, not all open ocean environ- 
ments clustered nicely with the Sargasso Sea samples. 
For example, the sites GS-22 and GS-37 seem to be 
atypical from the other open ocean samples, indicating 
that these sites have a different composition of detoxifi- 
cation proteins. Two of the most atypical environments, 
the high saline (GS-33) and fresh water (GS-20) samples, 
clustered separately from most of the other samples, 
while the two samples filtered to contain slightly larger 
microorganisms in the 0.8 to 3.0 um range (GS-Olb and 
GS-25) also clustered together. The latter two samples 
are expected to contain small single-cell eukaryotes as 
well as bacteria with large cell sizes. Similarly, the GS- 
00a site, that have been reported to be contaminated by 
Burkholderia [32] clustered separately from all other 
samples. We conclude that geographical location seemed 
to have an influence on the distribution of some, but far 
from all, of the detected detoxification systems in the 
metagenomic dataset. 

It was also clear from the clustering that certain de- 
toxification proteins were rather evenly distributed, 
while the appearance of others differed quite substan- 
tially between sites. This was in particular apparent 
when considering the overall variance over sampling 
sites for each protein (Figure 4); e.g. redoxin and the 
penicillin binding transpeptidase were roughly equally 
present at all sites (CV < 50%), while proteins like the 
C4-dicarboxylate transporter and arsenical pump mem- 
brane protein displayed a drastic variation in abundance 
depending on the sampling site (CV > 200%). However, 
for the highly variable proteins we also found a strong 
correlation to low abundance, indicating that for these 
proteins stochasticity plays a greater role and that the 
quantitative data from these sites should be viewed with 
caution. However, the majority of detoxification proteins 
displayed low CV values in the range 25 - 75% over the 
sampling sites, not exhibiting markedly greater variabil- 
ity than we observed for the three control proteins 
(RNA polymerase CV 25%; elongation factor CV 30%; 
DNA polymerase CV 50%). 

When we extended this analysis to compare the differ- 
ent habitats (open ocean, coastal and estuaries), we 
found that some of the detoxification systems were sig- 
nificantly more abundant at certain types of habitats, e.g. 
the beta-lactamase and the penicillin binding transpepti- 
dase families were both more abundant in the open 
ocean samples compared to the estuaries (p < 0.01; 



Figure 5a). Similarly, the peroxidases and thioredoxins 
were more common in the open ocean compared to 
coastal waters (p < 0.05). In addition, when we analyzed 
the detoxification genes in the three functional categor- 
ies metal resistance, transporters and oxidative stress 
together, it was apparent that oxidative stress genes were 
significantly more present in open ocean versus either 
coastal or estuary (p < 0.01 in both cases, data not 
shown). It is clear from these analyses, that some detoxi- 
fication systems were not equally distributed among 
marine sampling sites. However, we noted that the un- 
even distribution of some detoxification families was 
contrary to our initial expectation; in most cases the 
open ocean samples contained significantly higher levels 
of these detoxification proteins than the sites closer to 
land. Nevertheless, two metal resistance-related proteins, 
copper resistance protein CopC and the mercury resist- 
ance operon regulator MerR, showed significantly in- 
creased abundances in estuarine environments where we 
would expect the influence from human activity to be 
the greatest (Figure 5a). 

To further scrutinize the environmental distributions 
of detoxification systems, we used the geographical division 
based on estimated chemical composition at the various 
GOS sites proposed by Patel et al. (North Atlantic, Mid- 
Atlantic and Pacific samples) [33], and performed the 
same statistical analysis of relative abundance of detoxi- 
fication proteins as for habitats (Additional file 9: Figure 
S4). However, it should be noted that estimated chem- 
ical information is obtainable for only 29 of the GOS 
sites, excluding all the estuarine sites, which might be 
most severely impacted by human activities. Here, the 
divalent ion tolerance protein stood out as around four 
times more common in Mid- Atlantic and Pacific than 
in North Atlantic samples, and the oxidative stress re- 
sponse proteins peroxiredoxin and redoxin were more 
prevalent in Pacific samples than in the North Atlantic 
(p < 0.01 in all cases). In general, proteins that exhibited 
significant differences between North Atlantic, Mid- 
Atlantic and Pacific locations were more abundant in 
the Pacific and Mid- Atlantic samples, than in the North 
Atlantic. The notable exception to this was the multicopper 
oxidase family, which was significantly more abundant in 
both the Atlantic samples than in the Pacific (p < 0.01). 

The Patel et al. study also estimated the anthropogenic 
load on 29 of the samples investigated in our study [33]. 
We used their additional environmental variables together 
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Figure 3 Occurrence and distribution of six additional protein families in marine genomes and the GOS data. Occurrence of additional 
detoxification protein families not present in E. coll in fully sequenced and annotated marine genomes (a), their distribution in the GOS data 
normalized to per-genome equivalents (b), and normalized to the average detoxification gene content of 835 completely sequenced bacterial 
genomes (c). Bacterial names in blue color indicate species that are among the most commonly found in marine environments. 
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Figure 4 Variability of the detoxification proteins at the investigated marine sites. Variance expressed as coefficient of variation (CV) over 
all the sampled sites (right y-axis) in relation to the average number of reads belonging to that protein family (left y-axis). 



with the metadata from the GOS samples to make a princi- 
pal component analysis (Additional file 10: Figure S5). This 
analysis revealed that most detoxification proteins did not 
correlate with pollution and shipping, but rather with phys- 
ical properties such as temperature, salinity, water depth, 
and oxygen utilization. Finally, we separated the data into 
two sets, based on the estimated pollution at each site, pro- 
ducing a polluted and a non-polluted set of sampling sites 
(Additional file 11: Figure S6). It should be noted that esti- 
mated pollution correlated with shipping, but that high 
shipping levels did not always imply high pollution, and 
vice versa. Comparing the polluted and the non-polluted 
groups showed that proteins belonging to the multicopper 
oxidase and dioxygenase C families were significantly (p < 
0.01) more abundant at polluted sites (Figure 5b). 

Discussion 

Underrepresentation of well-characterized detoxification 
genes in marine environments 

The most striking finding of our metagenomic analysis 
of ocean samples is that there are surprisingly few de- 
toxification genes present in bacteria living in the marine 
environment compared to what is present in bacterial 
genomes in general. This statement is supported by 



several observations, i) The overall level of detoxification 
genes in sequenced genomes of marine bacteria compared 
to general bacterial genomes is low. This is particularly evi- 
dent for several general toxicant transporters, such as the 
ACR transporters and the major facilitator superfamily 
(Figure 1). In fact, in many of the marine bacteria most of 
the detoxification genes are missing completely, ii) The 
most abundant marine bacteria, and thus the ones clearly 
successful in the marine milieu, are missing a majority of 
the detoxification proteins, iii) There is a low level of de- 
toxification genes in the GOS metagenomic data, regardless 
of sampling site. The overall low number of detoxification 
systems is consistent with findings that most marine bac- 
teria lack a number of biological systems that are important 
or even essential in other environments [24]. These obser- 
vations have important implications for how we view 
detoxification proteins in general and their involvement in 
shaping bacterial ecology in particular. 

Many detoxification systems are part of a general 
core-set of genes that are present in most types of mi- 
croorganisms, and high representation of those should 
not immediately be viewed as their level of importance 
for detoxification. To handle this situation, we have nor- 
malized the data to the number of gene copies expected 
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Figure 5 Abundance of detoxification proteins between marine habitats, a) Comparison of the relative abundance of the detoxification proteins 
between the three major habitats open ocean (blue bars), coastal (red bars) and estuary (green bars), b) Comparison of relative abundance between 
non-polluted (blue) and polluted (red) sites. Bars indicate relative abundance in relation to the protein family abundance in the open ocean sites (a) or 
non-polluted sites (b), fixed to one, and only proteins that exhibited a significant difference in relative abundance between sites are displayed. Brackets 
indicate which comparisons that were significant; black brackets (p < 0.01) and grey brackets (p < 0.05). 
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to be found in general bacterial genomes. Detoxification 
genes are frequently present in multiple copies in microbial 
genomes, where gene multiplications are selected to yield 
high expression and/or functionally distinct paralogs to 
optimize survival and growth under harsh conditions. 
There are several examples of this, e.g. the drastically ele- 
vated copper tolerance of the European and Sake lineages 
of industrial yeast strains is related to high copy number for 
the gene encoding the copper binding protein Cupl [34], 
and cadmium tolerance in the cyanobacterium Synechococ- 
cus has been linked to gene amplification of the metallo- 
thionein smt [35]. Thus, it is currently widely accepted that 
gene duplication is one of the main genetic mechanisms 
shaping organisms during long-term adaptation. Our "para- 
log normalization" for gene content per genome should 
emphasize decreases and increases in gene number in 
relation to the expectations. 

It is interesting that we consistently find so low 
numbers of most detoxification proteins in the GOS 
data. This finding might seem contradictory to previous 
research on e.g. Alteromonas macleodii, showing pres- 
ence of heavy metal resistance genes [26]. However, the 
Alteromonas genus is not frequently found in open 
ocean surface water samples that are predominant in the 
GOS data set [24]. This means that the fact that species 
of this genus and other less common genera possess 
heavy metal resistance genes will not be contributing 
enough to the overall abundance of these genes at any 
specific GOS site. If a variety of such rare opportunists 
such as Alteromonas macleodii [36] are present in sur- 
face water, we would expect to find small numbers of a 
wide range of detoxification genes, which is indeed the 
case in the GOS samples (Figure 2). Additionally, the de- 
toxification genes in Alteromonas often seem to be located 
in genomic islands [26,36], indicating that they might be 
part of mobile elements. This raises the possibility that 
these genes are lowly abundant under normal conditions, 
but in certain circumstances they might be selected for, 
spread through a population, and consequently show a 
temporal and/or spatial burst in distribution. 

A strong contributing factor to the overall low abundance 
of detoxification genes at GOS sites is the high presence of 
Candidatus Pelagibacter species, which can constitute 
around a quarter of the bacteria present in ocean surface 
water [37]. In line with the extreme streamlining of the ge- 
nomes in this genus [25], we find that most detoxification 
systems are lowly abundant, or even completely absent in 
the Candidatus Pelagibacter ubique genome (Figure 1). 
The same pattern repeats also for the cyanobacterial genera 
Prochlorococcus and Synechococcus, which also possess 
streamlined genomes containing few detoxification genes - 
although to a lower degree than Pelagibacter, 

From our results, it is evident that we should not ex- 
pect the genomes of abundant marine bacteria that have 



not yet been sequenced to contain a wide arsenal of 
detoxification genes, as a wide distribution of such mi- 
crobes would have augmented the relative abundances 
of detoxification protein families far more than they are 
increased when normalized against marine bacteria in 
this study (Additional file 7: Figure S3). However, as 
there are still small numbers of many of those genes 
present in the GOS samples (Table 1), we would expect 
lowly abundant marine bacteria to harbor at least some 
of those families. 

We cannot exclude that there could be novel systems 
for detoxification in the ocean not present and/or not 
characterized in E. coll Groups of proteins with low se- 
quence similarity have been shown to still fold in similar 
three-dimensional conformations and to exhibit the 
same enzymatic activity [38]. This stresses the import- 
ance of functional metagenomics studies, where libraries 
of natural DNA are screened for specific enzymatic 
activities. Such analyses for novel sequence-function re- 
lations are important to perform on naturally occurring 
marine microbes, to get a complete understanding of the 
detoxification capacity/potential in the ocean. 

Oxidative stress - catalases and peroxidases in marine 
bacteria 

One important group of detoxification genes in the GOS 
data, which were found about as often as expected (com- 
pared to what is present in sequenced bacterial genomes), 
consisted of protein families involved in the oxidative stress 
response e.g. thioredoxins, glutathione peroxidases and 
peroxiredoxins (Figure 2; bottom). These proteins are also, 
in addition to being part of the oxidative stress defense, 
important in other aspects of cell physiology besides detoxi- 
fication. In addition, it is rational that these genes are 
abundant in marine environments, as many of the most 
common microorganisms in the ocean are photosynthetic 
and produce oxygen and oxygen radicals requiring efficient 
detoxification systems [39]. An interesting aspect of the 
oxidative stress proteins analyzed is that the catalase 
family seems to be almost absent in marine environ- 
ments (Table 1), while we found 1.6 times as many 
peroxidases than expected based on the sequenced ge- 
nomes (Additional file 7: Figure S3). This indicates that 
protection against H 2 0 2 could be conferred by peroxi- 
dases instead of catalases in dominant marine bacteria, as 
previously suggested by Bernroitner et al for the cyanobac- 
teria Synechococcus and Synechocystis [40]. The absence of 
typical metal-dependent catalases (katEs) in cyanobacteria 
has been proposed to reflect the absence of the need for 
diversification of catalases that coincided with the evolution 
of both eukaryotes and pathogenic bacteria, presumably to 
handle the high levels of H 2 0 2 produced by host defences 
[41]. In this model, bifunctional catalase-peroxidases 
(katGs) as well as peroxidases and peroxiredoxins would 
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provide sufficient protection against oxygen radicals pro- 
duced upon photosynthesis. In fact, katGs represent an 
evolutionary ancient protein family that lies at the evolu- 
tionary origin of a major superfamily of peroxidases and are 
thought to have been present in cyanobacteria already be- 
fore the evolution of oxygenic photosynthesis. In contrast 
to the GOS data, only two of the genomes of the most 
common marine bacteria contain peroxidases. A closer 
look at other peroxidase-active enzyme families, however, 
reveals that these marine bacteria contain more peroxire- 
doxins and glutathione peroxidases compared to the pre- 
dicted "standard" bacterial genome (Additional file 4: Table 
S3). This suggests that enzymes other than typical metal- 
dependent catalases protect cells from H 2 0 2 in the marine 
environment. Thus we conclude that peroxidases and/or 
peroxiredoxins appear to constitute the core cellular ROS 
defense mechanisms in the marine environment. 

The ecotoxicology of detoxification systems 

We found our original hypothesis - that detoxification 
systems would generally be more abundant close to the 
coast - not to hold, except in three cases; copper resistance 
protein CopC, mercury resistance operon repressor MerR 
and superoxide dismutase, which were significantly more 
common in estuaries (p < 0.05). On the contrary, for most 
of the proteins that displayed a significant geographical dis- 
tribution, like peroxidase, penicillin binding transpeptidase 
and divalent ion transport protein, the open ocean samples 
showed significantly higher abundance (Figure 5a). This is 
counter-intuitive since human activity should have a higher 
impact on waters close to the coast. The estimated an- 
thropogenic load [33] was, unfortunately, not available for 
any of the estuarine samples. However, salinity can be used 
as a proxy for the impact from river outflow of fresh water 
into the sea. The coastal sites clearly exhibit a higher impact 
from fresh water rivers indicated by its substantially lower 
salinities (average salinity coastal 32.0%o (±2.8), and open 
ocean 35.7%o (±1.9)), even if there is generally quite some 
distance to land for the coastal sites. Thus, we would expect 
the coastal sites to be more polluted compared to samples 
from the open ocean. However, the chemical data for open 
ocean and coastal samples by Patel et al, [33] indicates that 
there was not systematically higher pollution levels in the 
coastal environments (data not shown), which also is in line 
with our observations of detoxification systems. Interest- 
ingly, the multicopper oxidase and dioxygenase C families 
were the only protein families strongly related to the esti- 
mated pollution (Figure 5b). Overall, we observed that 
macrogeographic distribution might be more influential to 
the detoxification gene content than human-made pollut- 
ants (Figure 5 and Additional file 9: Figure S4), providing 
further indication that natural environmental factors have 
shaped the marine bacterial communities to a much larger 
extent than anthropogenic influence. These observations 



highlight the non-trivial relationships between pollu- 
tant selection pressure and how it shapes microbial 
communities in different environments. It is clear that 
to better understand these phenomena, future studies 
need to collect appropriate environmental data con- 
nected to the samples. 

It should be stressed that many of these detoxification 
systems might not only be of use in handling man-made 
pollutants, but could equally well be involved in the de- 
toxification of natural toxins in the open ocean. Several 
biotoxins are for instance produced by dinoflagellates 
and cyanobacteria [42,43]. Marine bacteria like Synecho- 
cystis and Synechococcus can produce a number of toxic 
compounds and the increase in nutrients due to human 
activities, and consequently the increase of eutrophica- 
tion, could enhance the possibility of harmful cyanobac- 
terial bloom formation and result in higher levels of 
toxins. The effect of these marine biotoxins on the 
bacterial communities in the open ocean is presently not 
known. However, it has been proposed that marine bac- 
teria, such as Vibrio and Pseudomonas spp., are capable 
of metabolizing biotoxins via oxidation reactions cata- 
lyzed by oxidases and peroxidases [44] . 

The large abundance of cyanobacteria could also 
explain the comparably high numbers of penicillin bin- 
ding transpeptidases in the oceanic environment. Trans- 
peptidases are present in cyanobacteria and are involved 
in biosynthesis and maintenance of bacterial peptidogly- 
can, required for proper formation of the cyanobacterial 
cell wall [45]. The peptidoglycan layer is also believed to 
be partially responsible for drought tolerance in cyano- 
bacteria. Cyanobacterial transpeptidases have been 
suggested as a possible origin of beta-lactamase genes, 
providing resistance to e.g. penicillin [46], and have been 
found in e.g. most Prochlorococcus strains. It is difficult 
to address whether the identified penicillin binding 
transpeptidase sequences constitute a cell wall main- 
tenance system and/or a detoxification system, as the 
sequence similarity between the two categories is high 
and the physiological mechanisms are not completely 
understood, especially in non-model organisms. 

Conclusions 

We conclude that the ocean as a habitat poses severe re- 
strictions to fitness by several environmental factors e.g. 
nutrient limitation and pH, while the selection pressure 
from toxicants in the marine environment seems to be 
comparably small. It is, however, important to note that 
the ocean does not only function as a habitat, but is also 
a huge dispersal matrix connecting habitable patches, 
also for bacterial species. Most of these opportunistic 
bacteria present in small quantities need to have a larger 
set of genes, as they not only cope with the harsh reality 
of the ocean, but also must be able to bloom or colonize 
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once they find a suitable habitat. However, we here re- 
port that the majority of marine bacteria in the open 
ocean do not possess a large variety of detoxification 
systems and would be predicted to have a lower capacity 
to adapt to polluted environments. It will be interesting 
to, in the future, examine specific highly polluted marine 
sites over longer time-periods, as well as to functionally 
characterize novel detoxification systems in marine 
bacteria. This will be essential to provide good estimates 
of future anthropogenic consequences on microbial 
life in the sea and could provide crucial information as 
ecotoxicology continues to move into metagenomic 
community analysis. 

Methods 

Protein family selection 

The Gene Ontology annotation [14] was used to select 
genes from the NCBI database related to detoxification 
mechanisms. The following nine GO-terms was used to 
extract genes: response to organic substance (0010033), 
response to molecule of fungal origin (0002238), response 
to molecule of bacterial origin (0002237), response to xeno- 
biotic stimulus (0009410), response to antibiotic (0046677), 
response to drug (0042493), response to inorganic sub- 
stance (0010035), response to toxin (0009636), and re- 
sponse to oxidative stress (0006979). To restrict this set to 
well characterized bacterial genes, only genes belonging to 
one of these categories that could also be found in Escheri- 
chia coli were selected, and their corresponding protein se- 
quences were retrieved. To determine the Pfam (release 24) 
[12] protein domains present in these protein sequences, all 
sequences were scanned against the Pfam database of 
profile-HMMs using HMMER3 [15] This resulted in a list 
of 159 protein families (Additional file 2: Table S2) that 
was manually inspected and reduced into a list of 31 
detoxification-related protein families (Table 1). In addition, 
the profile for RNA polymerase Rpb2, domain 6 (PF00562) 
was added to provide normalization information, and the 
profiles for DNA polymerase A (PF00476) and elongation 
factor TS (PF00889) were also added to the set for compari- 
son to ubiquitously occurring protein families. 

Data mining and abundance data 

The profile -HMM for each protein family in Table 1 was 
compared to the 6,028,191 proteins predicted from the 
GOS data set [19] using hmmsearch (part of HMMER3), 
applying the trusted cut-off threshold specified for each 
individual Pfam profile-HMM. Samples taken with dif- 
ferent filter sizes (GS-Ola, GS-Olb and GS-25), as well 
as the sample indicated to contain post-sampling con- 
tamination (GS-OOa) [32], were not excluded, as these 
samples are likely to show pronounced differences from 
the majority of the samples. The resulting number of se- 
quences can be found under "GOS Initial" in Table 1. 



Sequences producing significant matches were extracted, 
and scanned against the entire Pfam profile database. 
Only reciprocal hits, i.e. sequences that found the same 
profile as initially found them, were kept ("GOS Recipro- 
cal" in Table 1). There does not seem to be a significant 
correlation between the profile lengths and the number 
of sequences found in the GOS data (Additional file 12: 
Figure S7). The absence of such a trend is important 
since that could be a potential source of positive bias for 
short Pfam profiles because of the fragmentary and short 
sequence nature of metagenomic sequence data. The 
same Pfam profiles and procedure were used to deter- 
mine the number of detoxification genes in 835 se- 
quenced and annotated bacterial genomes (Additional 
file 3). The average number of genes corresponding to 
each protein family can be found in Additional file 4: 
Table S3. 

Environmental distribution data 

Environmental distribution data was retrieved using the 
GOS nucleotide read databases and their connected meta- 
data [30], downloaded from CAMERA [31]. Additional 
estimates of chemical parameters and anthropogenic in- 
fluence on 29 of the sample sites was derived from previ- 
ous work with the GOS samples [33]. Only reads that 
overlapped with the ORF coding for a protein of interest 
were included in the sampling site data search. At each 
GOS sampling site, the reads reciprocally matching a 
certain protein family were counted. To allow for per 
genome estimates for the protein families, RNA polyme- 
rase Rpb2, domain 6 (PF00562), corresponding to the 
rpoB gene, was used for normalization. This protein family 
is almost ubiquitously occurring in a single copy in the 
sequenced bacterial genomes (Additional file 3; Additional 
file 4: Table S3; Additional file 5: Figure SI), making 
division by the number of found proteins containing a 
Pfam RNA_pol_Rpb2_6 domain a suitable normalization 
procedure. The rpoB normalized values were correlated to 
environmental data [33] and used for the principal com- 
ponent analysis (Additional file 10: Figure S5). Normalized 
values were then converted to log 2 -scale. Sites with more 
than three protein families finding no reads, and protein 
families with more than three sites containing no reads 
were removed. A small number of remaining zeros that 
could not be translated to log-scale were set to -10, and a 
heat map (Figure 2, top) was created using R [47], with 
the additional components marray [48], amap [49] and 
gplots [50]. Hierarchical clustering was done using Eucli- 
dean distances for both sampling sites and protein 
families. A similar heat map was generated for the gene 
content of the sequenced bacterial genomes (Additional 
file 5: Figure SI). However, these genome values were not 
normalized to the RNA-polymerase (PF00562) protein 
family, as that would have severely skewed the data in 
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Additional file 9: Figure S4. Abundance of detoxification proteins 
across macrogeographic locations. Comparison of the relative abundance 
of the detoxification proteins between the three geographic locations 
suggested by Patel et al. [33]: North-Atlantic (blue bars), Mid-Atlantic 
(red bars) and Pacific (green bars). Bars indicate relative abundance in 
relation to the protein family abundance in the North Atlantic sites (fixed 
to one) and only proteins that exhibited a significant difference in relative 
abundance between locations are displayed. Brackets indicate which 
comparisons that were significant; black brackets (p < 0.01) and grey 
brackets (p < 0.05). 

Additional file 10: Figure S5. Principal component analysis of 
correlated protein families and environmental features. Protein families 
are color coded according to category: red - oxidative stress, blue - metal 
resistance, green - transporters, purple - other detoxification systems, 
grey - control proteins. Environmental features are represented in black. 

Additional file 11: Figure S6. Sample sites clustered by estimated 
pollution and shipping. The sites classified as polluted are marked by 
red. All other sites were classified as non-polluted (blue). Pollution and 
shipping data were estimated by Patel et al. [33], 

Additional file 12: Figure S7. Number of sequences matching to each 
of the initial 159 Pfam profile-HMMs. Numbers are plotted against the 
length of the Pfam profile. No significant correlation between length and 
matched sequences was observed. 

Additional file 13: Figure S8. Distribution of detoxification protein 
families in the metagenomic GOS data. Normalized to the average gene 
content of all 61 marine species/strains (as listed in Figure 1) included in 
our survey of 835 bacterial genomes. 

Additional file 14: Excel file containing the results of the protein 
family searches from the GOS sampling sites. 



some cases. The detoxification gene counts for the bacte- 
rial genomes were used to normalize the GOS data and 
generate a heat map showing the abundance of the detoxi- 
fication protein families compared to what would be 
expected from the bacterial genomes (Figure 2, bottom). 
In addition, a heat map consisting of only marine bacteria 
was constructed (Figure 1). Bacterial species were classi- 
fied as marine depending on their presence in a recent 
study by Yooseph et al [24], which analyzed the genomes 
of commonly found bacteria in the GOS data set. Finally, 
each abundance value from the GOS data was divided by 
the average count for the corresponding protein family in 
the marine bacterial genomes (Additional file 4: Table S3). 
The numbers produced were used to create a heat map 
showing the over- and underrepresentation relative to 
what would be expected from the marine bacterial 
genomes (Additional file 13: Figure S8). In cases where zero 
instances of a protein family were found in the genomes, 
GOS data was instead normalized using the value 0.01 (data 
for all protein families from each step can be found in 
Additional file 14). 

To test whether our set of protein families was rep- 
resentative of marine bacterial detoxification systems in 
general, we subjected six detoxification families known 
to be involved in detoxification, but not included in our 
R co//-derived data set (Additional file 8: Table S4), to 
the same procedure with HMMER searches and 
normalization as above (Figure 3). 

Availability of supporting data 

The data sets supporting the results of this article are 
included within the article and its additional files. 
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